Zillow has realized that its housing market predictions aren’t as accurate as they could be because they do not factor in enough local intelligence. Our group has been selected to build a better predictive model of home prices for San Francisco. Because there are so many variables and elements to influence home value in San Francisco, it is such a tough task to perform prediction. Even though it may takes substantial times and energy to gather and process raw data, it will still be interesting to take this challenge.
As far as this project, we will bring geospatial analysis and machine learning techniques together to generate our model. The goal of geospatial prediction is to borrow the experience from one place and test the extent to which that experience generalizes to another place. The machine learning process will be divided into four steps. Below is a brief discussion on each step of the process.
Data wrangling: The first step of the process is to gather and compile the appropriate data, often from multiple disparate sources, into one consistent dataset.This means the analyst has to understand the nature of the data, how it was collected and how to massage the raw data into useful predictive features.
Exploratory analysis: Thoughtful exploratory analyis often leads to more useful predictive models. Additionally, more reasonable exploratory research will make the analysis more interpretable for non-technical clients.
Feature engineering: Feature engineering is the vital point to differ a great model from a good one. Features are the variables used to predict the outcome of interest - in this case home prices. The more the analyst can convert data into useful features, the better her model will perform. There are two key considerations for doing this well.
Feature selection: In a geospatial prediction project, it is possible to contain hundreds of features in the dataset. It is much wiser to select useful features in a model instead of implementing all of them. The feature selection process is the process of whittling down all the possible features into a concise and parsimonious set that optimizes for accuracy and generalizability.
Model estimation and validation: In this project, we will develope a a home price prediction algorithm by using linear regression models. To aviod bias and inaccuracy, we will perform plenty of methods to validate a machine learning model for accuracy and generalizability in the following report.
Generally, the home value model we built can predict approximately 85 percent of the total homes from 2012-2015. The error is about 180,000 dollars. Considering the situation of home prediction difficulty, we think this model is well-performed.
Our first geospatial machine learning model will be trained on home price data from San Francisco - one of the largest cities in United States featuring some of the nation’s best travel destinations and a very robust technology sector. In this section, we will create a mapTheme, read the data, re-project the data to State Plane (2227), a coordinate system encoded as feet.
Next, we loaded one shapefile sp_original
which has 10131 house sale data points from 2012 to 2015. A map of the SF house price is presented below using this data set.
The data set above also includes various characteristics for San Francisco homes that are useful to our model construction, including:
houseAge
= 2019 - BuiltYear
)Census tract data tracts
is downloaded using tigirs
, which is a powerful package to access data in United States Census Bureau. A couple of tracts are removed from the original dataset because of the scope. Both data is reprojected to the appropriate State Plane.
Various data sets are downloaded from the San Francisco Open Data, which consists of the city’s economic, environmental, social, and many other aspects data. From this website, we have downloaded:
Census data is downloaded through the tidycensus
package, which allows access to both the decennial and the ACS census data. Here, we have chosen the ACS 5 year estimates in 2017 at the census tract level to work on. Both the ACS profile and the detailed table are used. Variables selected from the census include:
In the end, the locations of the precidio, twin peaks, the golden gate and the University of San Francisco are mannually combined into one data set.
We have categorized our predictors into three types: internal characteristics, amenities/public services, and spatial structue. Their summary statistics are presented below. Note that a few variables are not in the tables below since they are spatial or categorical data, including the neighborhoods and zonings.
A correlation matrix is produced to see the relationship between the continuous variables. The non-continuous variables, including all the census data variables, are dropped. To further simplify the correlation plot, only one variable is kept for each amenity.
A series explorations of correlation between the dependent variable sale price and other predictors are conducted using the scatter plots. Here, four of the most well-fitted scatter plots are presented. Fianc_nn5
is the distance to 5 Nearest financial services, ind_nn3
is the distance to 3 nearest attractions, pcsamehouse
is the percentage of residents lived in the same house for over 1 year, and tech_nn5
is the distance to 5 nearest tech companies.
Three of the independent variables are mapped below. For the first two variables, both the density of the amenity and the distance to the 5 nearest amenities are mapped. For the third variable, elevation, both the point elevation and the countour lines are mapped.
In this project, we are using Ordinary Least Square (OLS) as the model to estimate house prices. OLS is able to find the linear relationship between the dependent variable, which in this case is the house sale price, and the predictors, which are the variables mentioned above in the data collection section. The feature engineering is done in the following steps.
In order to construct and choose the most effective features, we ran an OLS linear regression model for all the predictors to test the fitness of the model. The output gives both the p-value and the coefficient for each variable. The p-value reflects the significance of the coefficient, in other words, the probability of having the coefficient assuming there is no correlation between the independent and the dependent variables. Typically, we want p-value to be smaller than 0.05. R-squared is also calculated, which demonstrates the amount of variance in the dependent variable that is explained by the predictors. A larger R-squared means a better estimation of the model and is desired.
We compare the p-values of the variables to check their significances in relation to the house price. This gives us a direction of what variables to keep or change. After removing or adding variables, we ran OLS model again to compare the R-squared value. Multiple trials are done to select the best set of predictors.
A series of feature engineering is done throughout the trial process. We have log-transformed the price related to the predictors in order to make them better fit for the linear model. We also have made several variables into categories, since continuous variables do not always have the most predictive power. For example, in a demand perspective, a potential mansion purchaser would not care about the exact numbers of the rooms, therefore, a house with 6 or 7 bedrooms might not influence the final house price that much.
We also have used the stepwise regression using the stepAIC
function in the MASS
package, which automatically finds the best set of predictors with the least R-squared value. However, since we have more model validation in the next step, we only used this function result as a reference.
It is critical that models generalize to data they haven’t seen before. The R^2 measures estimated measure error based on the data from which the model was trained. Below, data is split into training and test datasets. Models will be trained on the former and tested on the latter.
As we mentioned above, the total dataset is divided intotraining
and test
part, which represent 60% and 40% respectively. The table below discribes in-sample (training set) model results. We inlude all the variables in the training set, so that it is clear to see which variable has more significant influence. The estimate illustrates the coiefficient, the next three columns are standard error, t-value and p-value. By scrolling the table, it is easier to compare these variables.
Estimate | Std..Error | t.value | Pr…t.. | |
---|---|---|---|---|
(Intercept) | 4.2639562 | 0.6616621 | 6.4443106 | 0.0000000 |
PropClassCD | 0.5755506 | 0.1481561 | 3.8847586 | 0.0001036 |
PropClassCDA | 0.4618116 | 0.1559621 | 2.9610491 | 0.0030786 |
PropClassCF | 0.5390863 | 0.1514164 | 3.5602904 | 0.0003735 |
PropClassCLZ | 0.4236747 | 0.1603939 | 2.6414631 | 0.0082775 |
PropClassCOZ | 0.0765424 | 0.2709996 | 0.2824445 | 0.7776130 |
PropClassCTH | 0.3770257 | 0.1633152 | 2.3085764 | 0.0210029 |
PropClassCTIC | -0.1104511 | 0.1507787 | -0.7325379 | 0.4638705 |
PropClassCZ | 0.3901430 | 0.1485135 | 2.6269861 | 0.0086376 |
PropClassCZBM | -0.0865445 | 0.1647161 | -0.5254164 | 0.5993143 |
PropClassCZEU | 1.0460826 | 0.2574477 | 4.0632822 | 0.0000490 |
LotArea | 0.0000001 | 0.0000000 | 4.2285058 | 0.0000239 |
PropArea | 0.0001913 | 0.0000073 | 26.2462328 | 0.0000000 |
Stories | 0.0121555 | 0.0051416 | 2.3641616 | 0.0181044 |
Rooms_cat0 | 0.2359082 | 0.0392759 | 6.0064410 | 0.0000000 |
Rooms_cat1-3 | -0.0148150 | 0.0413008 | -0.3587109 | 0.7198247 |
Rooms_cat10-12 | 0.1495376 | 0.0329817 | 4.5339511 | 0.0000059 |
Rooms_cat4-7 | 0.1787997 | 0.0345255 | 5.1787739 | 0.0000002 |
Rooms_cat7-9 | 0.1735513 | 0.0329259 | 5.2709730 | 0.0000001 |
Beds_cat1-2 | -0.0151342 | 0.0081737 | -1.8515878 | 0.0641370 |
Beds_cat3-5 | 0.0296502 | 0.0077134 | 3.8439851 | 0.0001224 |
Beds_cat6- | -0.0832438 | 0.0305800 | -2.7221647 | 0.0065054 |
Baths | 0.0089753 | 0.0045971 | 1.9523729 | 0.0509431 |
SaleYr | 0.1311513 | 0.0025949 | 50.5411144 | 0.0000000 |
crime_nn5 | -0.0000147 | 0.0000285 | -0.5174862 | 0.6048369 |
school_nn3 | -0.0000127 | 0.0000088 | -1.4452189 | 0.1484516 |
park_nn1 | -0.0000029 | 0.0000077 | -0.3721269 | 0.7098123 |
Distance_N | -0.0000058 | 0.0000055 | -1.0501808 | 0.2936797 |
Distance.y | -0.0000082 | 0.0000088 | -0.9266685 | 0.3541381 |
totaltree | 0.0000000 | 0.0000000 | 1.1994673 | 0.2303964 |
transit_cat100-150 | 0.0599836 | 0.0573878 | 1.0452324 | 0.2959601 |
transit_cat150-200 | 0.0861039 | 0.0557840 | 1.5435237 | 0.1227595 |
transit_cat200-250 | 0.0855872 | 0.0543222 | 1.5755482 | 0.1151857 |
transit_cat250-300 | 0.0875179 | 0.0537792 | 1.6273553 | 0.1037172 |
transit_cat300-350 | 0.0803336 | 0.0533225 | 1.5065603 | 0.1319791 |
transit_cat350-400 | 0.0995047 | 0.0536454 | 1.8548610 | 0.0636679 |
transit_cat400-450 | 0.0329856 | 0.0541271 | 0.6094109 | 0.5422765 |
transit_cat50-100 | 0.1033631 | 0.0643256 | 1.6068725 | 0.1081380 |
hght_cat<50 | 0.0465552 | 0.0743450 | 0.6262044 | 0.5312061 |
hght_cat>100 | 0.0020995 | 0.0785863 | 0.0267158 | 0.9786874 |
elev.cat1>700 | -0.0069740 | 0.0493066 | -0.1414417 | 0.8875260 |
elev.cat1100-150 | -0.0213546 | 0.0177310 | -1.2043673 | 0.2284979 |
elev.cat1150-200 | 0.0018289 | 0.0187736 | 0.0974185 | 0.9223975 |
elev.cat1200-250 | -0.0168260 | 0.0194673 | -0.8643183 | 0.3874496 |
elev.cat1250-300 | -0.0036226 | 0.0206547 | -0.1753887 | 0.8607805 |
elev.cat1300-350 | -0.0142960 | 0.0217078 | -0.6585648 | 0.5102020 |
elev.cat1350-400 | -0.0236293 | 0.0239669 | -0.9859120 | 0.3242183 |
elev.cat1400-450 | 0.0017987 | 0.0257546 | 0.0698416 | 0.9443222 |
elev.cat1450-500 | -0.0191671 | 0.0303781 | -0.6309509 | 0.5280980 |
elev.cat150~100 | -0.0215976 | 0.0151805 | -1.4227154 | 0.1548736 |
elev.cat1500-550 | -0.0462475 | 0.0344019 | -1.3443309 | 0.1788951 |
elev.cat1550-600 | -0.0199127 | 0.0385885 | -0.5160265 | 0.6058560 |
elev.cat1600-650 | 0.0093565 | 0.0405282 | 0.2308637 | 0.8174290 |
elev.cat1650-700 | -0.0673372 | 0.0462339 | -1.4564458 | 0.1453248 |
zoning_simRH-1 | -0.0104756 | 0.0152618 | -0.6863949 | 0.4924922 |
zoning_simRH-1(D) | 0.0073797 | 0.0216408 | 0.3410112 | 0.7331078 |
zoning_simRH-23 | -0.0056301 | 0.0135150 | -0.4165862 | 0.6769969 |
med_hhincome_ln | 0.0424431 | 0.0376544 | 1.1271756 | 0.2597159 |
pcWhite | -0.1031036 | 0.1075948 | -0.9582583 | 0.3379734 |
pcBlack | -0.4778166 | 0.1632112 | -2.9275970 | 0.0034295 |
pcAsian | -0.2010922 | 0.1152188 | -1.7453079 | 0.0809854 |
pcHispanic | -0.2882608 | 0.1115376 | -2.5844265 | 0.0097789 |
pcbachedegree | 0.0004485 | 0.0009339 | 0.4802262 | 0.6310851 |
houseAge | 0.0000693 | 0.0000153 | 4.5405581 | 0.0000057 |
pcunder18 | 0.0025540 | 0.0016514 | 1.5465598 | 0.1220252 |
pcabove65 | 0.0020981 | 0.0015027 | 1.3962367 | 0.1626978 |
pc2vehicles | -0.0026315 | 0.0008696 | -3.0260450 | 0.0024889 |
totalpop | 0.0000059 | 0.0000030 | 1.9608177 | 0.0499491 |
pc3ormorevehicles | -0.0020078 | 0.0011340 | -1.7704616 | 0.0767040 |
pcrenter | 0.0000642 | 0.0006946 | 0.0924928 | 0.9263098 |
pcdetached | 0.0003550 | 0.0005725 | 0.6200595 | 0.5352435 |
pcvacant | -0.0023714 | 0.0013304 | -1.7824454 | 0.0747301 |
pcsamehouse | -0.0006046 | 0.0011584 | -0.5219337 | 0.6017369 |
medianhomevalue_ln | 0.0534132 | 0.0403019 | 1.3253265 | 0.1851162 |
medianrent_ln | -0.0171124 | 0.0204715 | -0.8359132 | 0.4032389 |
water | -0.0000038 | 0.0000033 | -1.1610185 | 0.2456832 |
Pr_hlth_nn5 | 0.0000107 | 0.0000105 | 1.0151693 | 0.3100684 |
Financ_nn5 | -0.0000303 | 0.0000081 | -3.7224370 | 0.0001992 |
ind_nn3 | -0.0000013 | 0.0000033 | -0.3888641 | 0.6973912 |
nbrhoodAnza Vista | 0.0575581 | 0.0989783 | 0.5815227 | 0.5609113 |
nbrhoodBalboa Terrace | -0.3414999 | 0.0897596 | -3.8046055 | 0.0001435 |
nbrhoodBayview | -0.4190595 | 0.0853798 | -4.9081799 | 0.0000009 |
nbrhoodBayview Heights | -0.3519784 | 0.0922180 | -3.8168089 | 0.0001366 |
nbrhoodBernal Heights | -0.1443371 | 0.0727673 | -1.9835438 | 0.0473548 |
nbrhoodBuena Vista Park/Ashbury Heights | -0.0781552 | 0.0682499 | -1.1451333 | 0.2522022 |
nbrhoodCandlestick Point | 0.0188987 | 0.1155143 | 0.1636047 | 0.8700482 |
nbrhoodCentral Richmond | -0.2476063 | 0.0714027 | -3.4677434 | 0.0005287 |
nbrhoodCentral Sunset | -0.2857717 | 0.0731820 | -3.9049441 | 0.0000953 |
nbrhoodCentral Waterfront/Dogpatch | -0.0296298 | 0.1217184 | -0.2434289 | 0.8076819 |
nbrhoodClarendon Heights | -0.0140666 | 0.0842719 | -0.1669192 | 0.8674396 |
nbrhoodCole Valley/Parnassus Heights | -0.0698871 | 0.0704930 | -0.9914047 | 0.3215303 |
nbrhoodCorona Heights | -0.0431005 | 0.0706445 | -0.6101044 | 0.5418171 |
nbrhoodCow Hollow | 0.0770465 | 0.0739473 | 1.0419113 | 0.2974972 |
nbrhoodCrocker Amazon | -0.3186373 | 0.0849190 | -3.7522508 | 0.0001770 |
nbrhoodDiamond Heights | -0.2154483 | 0.0766903 | -2.8093279 | 0.0049815 |
nbrhoodDowntown | 0.0040233 | 0.1276963 | 0.0315070 | 0.9748663 |
nbrhoodDuboce Triangle | 0.0077086 | 0.0821412 | 0.0938451 | 0.9252355 |
nbrhoodEureka Valley / Dolores Heights | -0.0228854 | 0.0656139 | -0.3487884 | 0.7272611 |
nbrhoodExcelsior | -0.3493096 | 0.0807957 | -4.3233689 | 0.0000156 |
nbrhoodFinancial District/Barbary Coast | 0.0008044 | 0.1696964 | 0.0047400 | 0.9962182 |
nbrhoodForest Hill | -0.2138032 | 0.0789687 | -2.7074404 | 0.0068007 |
nbrhoodForest Hills Extension | -0.2356711 | 0.0839459 | -2.8074169 | 0.0050111 |
nbrhoodForest Knolls | -0.3230069 | 0.0822533 | -3.9269780 | 0.0000870 |
nbrhoodGlen Park | -0.1629115 | 0.0715477 | -2.2769636 | 0.0228255 |
nbrhoodGolden Gate Heights | -0.2383626 | 0.0766724 | -3.1088444 | 0.0018875 |
nbrhoodHaight Ashbury | -0.0071368 | 0.0740746 | -0.0963458 | 0.9232494 |
nbrhoodHayes Valley | -0.0292133 | 0.0675228 | -0.4326439 | 0.6652899 |
nbrhoodHunters Point | -0.4161678 | 0.1034729 | -4.0219975 | 0.0000585 |
nbrhoodIngleside | -0.3114883 | 0.0784659 | -3.9697294 | 0.0000728 |
nbrhoodIngleside Heights | -0.3169694 | 0.0818353 | -3.8732598 | 0.0001086 |
nbrhoodIngleside Terrace | -0.3457296 | 0.0841945 | -4.1063214 | 0.0000408 |
nbrhoodInner Mission | -0.1027365 | 0.0719483 | -1.4279201 | 0.1533699 |
nbrhoodInner Parkside | -0.2817343 | 0.0734866 | -3.8338199 | 0.0001275 |
nbrhoodInner Richmond | -0.1557548 | 0.0724680 | -2.1492916 | 0.0316534 |
nbrhoodInner Sunset | -0.2237886 | 0.0705252 | -3.1731716 | 0.0015159 |
nbrhoodJordan Park / Laurel Heights | -0.0455100 | 0.0762639 | -0.5967443 | 0.5507019 |
nbrhoodLake Shore | -0.2345527 | 0.0864963 | -2.7117077 | 0.0067139 |
nbrhoodLake Street | -0.1125872 | 0.0746021 | -1.5091701 | 0.1313110 |
nbrhoodLakeside | -0.3057071 | 0.1113176 | -2.7462609 | 0.0060469 |
nbrhoodLincoln Park | -0.0699205 | 0.1407764 | -0.4966777 | 0.6194356 |
nbrhoodLittle Hollywood | -0.1886448 | 0.1049682 | -1.7971606 | 0.0723633 |
nbrhoodLone Mountain | -0.1296185 | 0.0747212 | -1.7346960 | 0.0828490 |
nbrhoodLower Pacific Heights | -0.1179994 | 0.0697430 | -1.6919167 | 0.0907168 |
nbrhoodMarina | 0.0286487 | 0.0760877 | 0.3765217 | 0.7065431 |
nbrhoodMerced Heights | -0.2522350 | 0.0810730 | -3.1112076 | 0.0018725 |
nbrhoodMerced Manor | -0.3212896 | 0.0938382 | -3.4238668 | 0.0006217 |
nbrhoodMidtown Terrace | -0.2482746 | 0.0816748 | -3.0397935 | 0.0023782 |
nbrhoodMiraloma Park | -0.2631985 | 0.0773026 | -3.4047819 | 0.0006668 |
nbrhoodMission Bay | 0.4323335 | 0.2288954 | 1.8887821 | 0.0589720 |
nbrhoodMission Dolores | -0.0185553 | 0.0765687 | -0.2423348 | 0.8085295 |
nbrhoodMission Terrace | -0.3109898 | 0.0778739 | -3.9935057 | 0.0000659 |
nbrhoodMonterey Heights | -0.2483566 | 0.0876482 | -2.8335638 | 0.0046195 |
nbrhoodMount Davidson Manor | -0.3192811 | 0.0829064 | -3.8511031 | 0.0001189 |
nbrhoodNob Hill | 0.0943596 | 0.1023746 | 0.9217098 | 0.3567191 |
nbrhoodNoe Valley | -0.0380521 | 0.0665507 | -0.5717762 | 0.5674962 |
nbrhoodNorth Beach | 0.1127144 | 0.1156069 | 0.9749800 | 0.3296117 |
nbrhoodNorth Panhandle | -0.0508552 | 0.0687353 | -0.7398704 | 0.4594092 |
nbrhoodNorth Waterfront | 0.0618623 | 0.1681141 | 0.3679781 | 0.7129033 |
nbrhoodOceanview | -0.3301454 | 0.0820171 | -4.0253255 | 0.0000576 |
nbrhoodOuter Mission | -0.3500469 | 0.0825430 | -4.2407807 | 0.0000226 |
nbrhoodOuter Parkside | -0.2963253 | 0.0761693 | -3.8903536 | 0.0001012 |
nbrhoodOuter Richmond | -0.2682177 | 0.0749564 | -3.5783159 | 0.0003487 |
nbrhoodOuter Sunset | -0.2742651 | 0.0752594 | -3.6442667 | 0.0002706 |
nbrhoodPacific Heights | 0.1322949 | 0.0690104 | 1.9170302 | 0.0552843 |
nbrhoodParkside | -0.2863582 | 0.0744282 | -3.8474445 | 0.0001207 |
nbrhoodPine Lake Park | -0.3217252 | 0.0925085 | -3.4777922 | 0.0005094 |
nbrhoodPortola | -0.3293526 | 0.0834588 | -3.9462880 | 0.0000803 |
nbrhoodPotrero Hill | -0.1122644 | 0.0736244 | -1.5248256 | 0.1273583 |
nbrhoodPresidio Heights | 0.0164464 | 0.0785913 | 0.2092651 | 0.8342488 |
nbrhoodRussian Hill | 0.1100434 | 0.0784759 | 1.4022566 | 0.1608933 |
nbrhoodSaint Francis Wood | -0.1263398 | 0.0789991 | -1.5992562 | 0.1098194 |
nbrhoodSea Cliff | -0.0233098 | 0.0878560 | -0.2653179 | 0.7907742 |
nbrhoodSherwood Forest | -0.1980640 | 0.0974260 | -2.0329689 | 0.0421021 |
nbrhoodSilver Terrace | -0.3218188 | 0.0857195 | -3.7543226 | 0.0001755 |
nbrhoodSouth Beach | 0.1648346 | 0.1162110 | 1.4184086 | 0.1561264 |
nbrhoodSouth of Market | -0.0424642 | 0.0802069 | -0.5294325 | 0.5965262 |
nbrhoodStonestown | 0.0271216 | 0.1517653 | 0.1787076 | 0.8581737 |
nbrhoodSunnyside | -0.2499599 | 0.0733553 | -3.4075225 | 0.0006601 |
nbrhoodTelegraph Hill | 0.0409399 | 0.0875861 | 0.4674251 | 0.6402137 |
nbrhoodTenderloin | 0.5418892 | 0.3081128 | 1.7587362 | 0.0786762 |
nbrhoodTwin Peaks | -0.0780165 | 0.0807414 | -0.9662522 | 0.3339592 |
nbrhoodVan Ness/Civic Center | -0.1093575 | 0.1655360 | -0.6606269 | 0.5088784 |
nbrhoodVisitacion Valley | -0.3092418 | 0.0883157 | -3.5015479 | 0.0004661 |
nbrhoodWest Portal | -0.1892915 | 0.0786002 | -2.4082837 | 0.0160594 |
nbrhoodWestern Addition | -0.1499799 | 0.0863862 | -1.7361556 | 0.0825906 |
nbrhoodWestwood Highlands | -0.2498507 | 0.0872422 | -2.8638755 | 0.0042003 |
nbrhoodWestwood Park | -0.2045752 | 0.1110738 | -1.8417948 | 0.0655573 |
nbrhoodYerba Buena | 0.0444113 | 0.1848998 | 0.2401913 | 0.8101906 |
lagPrice_ln | -0.1988308 | 0.0164408 | -12.0937309 | 0.0000000 |
SalePrice.buffv_ln | 0.6296930 | 0.0171373 | 36.7439033 | 0.0000000 |
c..Training.. | summary.Training..r.squared | summary.Training..adj.r.squared |
---|---|---|
Training | 0.8576808 | 0.8534425 |
As we analysis the test set which occupied 40% of the total set, the prediction model still has some inacurracy. The graphic below shows the saleprice compared with saleprice predict.
model | mean_saleprice.AbsError | mean_saleprice.APE_percent |
---|---|---|
test | 178002.6 | 15.74407 |
Another diagram which can illustrate the distribution of prediction errors in test set. The majority are within 0-200000 absError range, while some of them are beyond 400000.
Estimating a model on training set and predicting for test set is a good way to understand how well the model might generalize or predict for homes that haven’t actually sold. However, in order to improve generalizability, we perform cross-validation method, which will process the entire data 100 times. Cross-validation allows one to judge generalizability not on one random hold out but on many, helping to ensure that the goodness of fit on one hold out is not a fluke.
Source:https://en.wikipedia.org/wiki/Cross-validation_(statistics)#K-fold_cross-validation
Below is the result of our cross-validation model.we didn’t use the log price data in this model because it’s hard to transform back to the actual price in Cross-Validation. In this case, the MAE is a littile bit high which is about 200000 compared with 180000 before.
## intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 TRUE 299118.1 0.8165478 203908.9 38288.06 0.04399277 18957.19
When we use the linear model to visualize the predicted and actual sale price, it is obvious that there are still some error existing.
Two maps below illustrates the test set sale price and absolute sale price errors. It looks like more errors happened in center and northern part of San Francisco, which are the higher housing value areas.
The Moran’s I statistic is a statistical hypothesis test that asks whether and to what extent a spatial phenomenon exhibits a given spatial process.In practice, the test is looking to see how local means deviate from the global mean. A Moran’s I that is positive nearing the value of 1, describes positive spatial autocorrelation, also known as clustering. Instances where high and low prices “repel” one another, are said to be dispersed. Finally, where positive and negative values are randonly distributed, the Moran’s I statistic is 0.
In our model, the Moran’s I is 0.02 which is a good number. In other words, it means our model is randonly distributed.
##
## Monte-Carlo simulation of Moran I
##
## data: filter(sp.test, !is.na(SalePrice.Error))$SalePrice.Error
## weights: spatialWeights.test
## number of simulations + 1: 1000
##
## statistic = 0.016313, observed rank = 944, p-value = 0.056
## alternative hypothesis: greater
Another map provides the predicted values for the entire dataset It is clear our model is consistent across the neighborhoods to some extend. We can conclude that our model is generalizable. However, more errors occur when we predict wealthier communities.
Finally, when we divide income groups to test the generalizibilty, it has the same result as we disccused above. Higher income areas have lower prediction accuraccy.
The MAPE is about 16%, which means on average, our model only deviates from the actual price by 16%, which means it is pretty effective for house price estimations.
Feature engineering and selection is a very important step in our model construction. There are three types of features in our models:
1.Spatial Structure: One of the most important predictor features is the spatial lag variable, which calculated the mean prices of neighboring houses within a buffer distance of 1/16 miles for each house sale points. In this way, the model captures the spatial autocorrelation very well. We included this predictor in the model in the end and it significantly increases the model prediction.
2.Amenities: Other important variables include the distance to different types of services, including financial services and real estate leasing services.
3.Internal Characteristics: Property area is the most significant factor that correlates with the home price.
The MAPE map has demonstrated whether the spatial autocorrelation is accounted for in our model. The darker blue regions where house prices higher tend to have higher MAPE, which means our model does not predict well for higher-priced homes values. In contrast, the model predicts the regions with lower house prices much better, with MAPE of about 7%, in other words, our prediction only deviates away from the true prices by 7%. We believe the spatial variation in MAPE is because the prices of high-value homes fluctuate more and are influenced by more complicated factors.
We would highly recommend our model to Zillow, as it has a very small error in predicting house prices. Even though it is less representative in higher home value communities, the model still has great MAE, MAPE and r-squared. As we disccussed before, the model can be improved in multiple ways.Additionally, Non-linear models other OLS can be considered for better fit.